Skip to content

Conversation

@akeaswaran
Copy link
Collaborator

@akeaswaran akeaswaran commented Jan 4, 2026

Summary by CodeRabbit

  • New Features

    • Play descriptions are now cleaned and normalized and exposed as a public helper to improve parsing accuracy.
    • Yardage parsing enhanced to recognize more rushing and passing text patterns.
  • Bug Fixes

    • Week validation widened to accept week 16 (fix impacting play-by-play processing).
  • Behavior Changes

    • Several play/query functions now default season_type to "both" instead of "regular".
  • Tests / Chores

    • Added 2025 PBP yardage tests, updated play-type/list expectations, and bumped release to v2.2.0.

✏️ Tip: You can customize this high-level summary in your review settings.

akeaswaran and others added 18 commits September 3, 2025 14:01
Removed deprecated labels for elapsed time columns in cfbd_drives and added new columns to cfbd_live_plays documentation and tests. Updated test expectations to use expect_in instead of expect_setequal for column checks.
Added new columns to cfbd_play_stats_player output, improved sack player aggregation, handled NULL values, and updated documentation and examples to reflect changes. Also updated cfbd_live_plays documentation to include new columns for average start yard line and deserve to win metrics.
Changed cfbd_pbp_data to assign 3 timeouts per half for offense and defense when timeout data is missing from the API. Updated documentation and examples to reflect this behavior.
Added .groups = "drop" to the dplyr::summarise call in add_play_counts to control grouping behavior and prevent potential warnings in future dplyr versions.
Removed the specific count of variables from the return value description in both R and Rd files to improve maintainability and accuracy as the data frame structure may change.
… function usage in play data functions

Corrected spacing and replaced superseded `dplyr::distinct_all()` with `dplyr::distinct()`, and standardized assignment spacing for improved code readability and consistency.
Added a check to filter out games with fewer than 20 plays in the play-by-play data processing. This helps avoid issues with EPA/WPA models and improves data validation.
Update package version to 2.1.0. Add release notes for bug fixes in `cfbd_pbp_data()` and improvements to `add_yardage()` handling missing yardage values. Update cran-comments to reflect minor release and changes.
Added normalization for 'seasonType' to 'season_type' in cfbd_stats_game_advanced. Updated tests to check for column inclusion with expect_in instead of expect_setequal, and added team ID columns to betting lines test.
@vercel
Copy link

vercel bot commented Jan 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
cfbfastr Ready Ready Preview, Comment Jan 12, 2026 5:20am

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 4, 2026

📝 Walkthrough

Walkthrough

Adds a new exported play-text preprocessing function clean_play_text(play_df), integrates it into the per-game PBP processing flow, switches yardage parsing to operate on cleaned_text with expanded regexes, updates defaults and tests, and bumps package version to 2.2.0.

Changes

Cohort / File(s) Summary
Package metadata
DESCRIPTION
Version bumped 2.1.02.2.0; added patrick to Suggests.
Namespace & imports
NAMESPACE
Exported clean_play_text; added importFrom(stringr, str_replace).
Play-text cleaning & integration
R/cfbd_pbp_data.R, man/helpers_pbp.Rd
Added exported clean_play_text(play_df); wired clean_play_text() into cfbd_pbp_data per-game pipeline before penalty detection; documented in man/helpers_pbp.Rd.
Yardage extraction logic
R/helper_pbp_add_yardage.R
add_yardage() now ensures/uses cleaned_text (added when absent), replaced play_text references with cleaned_text, and expanded regex/fallbacks for rush/pass/receiving yard parsing (covers 2025 variants and edge cases).
Function defaults & validation
R/cfbd_games.R, R/cfbd_play.R, R/utils.R
Changed season_type defaults from "regular""both" in several APIs; widened validate_week() upper bound from 15 → 16 (error message not updated).
Tests
tests/testthat/test-cfbd_pbp_data.R, tests/testthat/test-cfbd_play_stats_types.R, tests/testthat/test-cfbd_play_types.R
Added parameterized 2025 yardage tests using patrick; updated expected row counts (cfbd_play_stats_types: 25→26, cfbd_play_types: 48→49).
Project & docs
cfbfastR.Rproj, NEWS.md, cran-comments.md
Project file reformatting only; added NEWS entries for v2.2.0 and CRAN comments reflecting bugfix and default-parameter changes.

Sequence Diagram(s)

sequenceDiagram
  participant cfbd_pbp_data as cfbd_pbp_data()
  participant clean_fn as clean_play_text()
  participant addyard as add_yardage()
  participant penalty as penalty_detection()
  participant epa as EPA/WPA_calc

  cfbd_pbp_data->>clean_fn: pass play_df (per game)
  clean_fn-->>cfbd_pbp_data: return play_df with cleaned_text
  cfbd_pbp_data->>addyard: pass play_df with cleaned_text
  addyard-->>cfbd_pbp_data: return play_df with yds_rushed/yds_receiving
  cfbd_pbp_data->>penalty: pass augmented play_df
  penalty-->>cfbd_pbp_data: return play_df post-penalty handling
  cfbd_pbp_data->>epa: compute EPA/WPA using processed play_df
  epa-->>cfbd_pbp_data: EPA/WPA results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped through messy play-by-play,
I nudged the times and places away.
Cleaned strings I stitched with regex art,
Now yard lines count and fields depart.
Hooray for tidy plays—hip hop hooray!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'More receiving yards parsing cases' accurately reflects the main focus of the changeset, which introduces expanded yardage extraction logic and additional parsing cases in the helper functions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
R/helper_pbp_add_yardage.R (1)

44-56: Document the dependency on cleaned_text column.

The function now relies on a cleaned_text column (used starting at line 58) but doesn't validate its presence or document this requirement. Consider either:

  1. Adding parameter documentation that play_df must contain a cleaned_text column
  2. Adding a validation check at the start of the function
  3. Creating the column if it doesn't exist by calling clean_play_text()
🔎 Example validation check
 add_yardage <- function(play_df) {
+  if (!"cleaned_text" %in% names(play_df)) {
+    stop("play_df must contain a 'cleaned_text' column. Call clean_play_text() first.")
+  }
+  
   play_df$yds_rushed <- NA_real_
   play_df$yds_receiving <- NA_real_
🧹 Nitpick comments (1)
R/cfbd_pbp_data.R (1)

2168-2179: Consider refactoring for efficiency and completeness.

The function could be improved in several ways:

  1. Efficiency: Multiple str_replace calls that reassign to the same column are less efficient than chaining operations or using str_replace_all with multiple patterns.
  2. Whitespace handling: Consider adding str_trim() at the end to remove leading/trailing whitespace that may result from the replacements.
  3. Pattern consolidation: Some patterns could potentially be combined (e.g., the two "No Huddle" patterns on lines 2174-2175).
💡 Example refactored implementation
clean_play_text <- function(play_df) {
  play_df <- play_df %>%
    dplyr::mutate(
      cleaned_text = .data$play_text %>%
        stringr::str_replace("^\\(\\d{1,2}:\\d{2}\\)\\s+", "") %>%
        stringr::str_replace("\\s\\b(short|deep)\\b\\s", " ") %>%
        stringr::str_replace("\\s\\b(left|middle|right)\\b\\s", " ") %>%
        stringr::str_replace("\\s*No Huddle-Shotgun\\s+", "") %>%
        stringr::str_replace("No Huddle-?", "") %>%
        stringr::str_replace("\\s*Shotgun\\s+", "") %>%
        stringr::str_replace("\\s+", " ") %>%
        stringr::str_trim()
    )
}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b80c73a and a4ff4e1.

📒 Files selected for processing (7)
  • DESCRIPTION
  • NAMESPACE
  • R/cfbd_pbp_data.R
  • R/helper_pbp_add_yardage.R
  • cfbfastR.Rproj
  • man/helpers_pbp.Rd
  • tests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • R/helper_pbp_add_yardage.R
  • R/cfbd_pbp_data.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
🔇 Additional comments (6)
DESCRIPTION (1)

61-61: LGTM! Appropriate dependency addition.

Adding patrick for parameterized testing aligns with the test updates mentioned in the PR summary.

NAMESPACE (1)

70-70: LGTM! Appropriate NAMESPACE updates.

The new export and import declarations correctly support the clean_play_text() function.

Also applies to: 152-152

man/helpers_pbp.Rd (1)

11-11: LGTM! Complete documentation for the new function.

The documentation properly describes clean_play_text() including its purpose, usage, and return values. The note about ESPN PBP changes in 2025 provides helpful context.

Also applies to: 26-26, 181-184, 342-345

R/cfbd_pbp_data.R (1)

598-598: LGTM! Appropriate pipeline placement.

Calling clean_play_text() before penalty_detection() ensures the text is cleaned early in the processing pipeline.

R/helper_pbp_add_yardage.R (1)

157-238: Inconsistent use of cleaned_text vs play_text across yardage extractions should be clarified.

The yds_rushed and yds_receiving calculations use cleaned_text (lines 58–150), while yds_int_return, yds_kickoff, yds_punted, yds_fumble_return, and yds_sacked still use play_text (lines 157–238).

The cleaned_text function removes pass depth/direction qualifiers (short/deep/left/middle/right), clock timestamps, and huddle/shotgun markers. These removals likely benefit rush and receiving patterns but may not affect special teams or interception patterns, which use different regex anchors. However, this design decision lacks documentation. Either:

  • Add a comment explaining why only rush/receiving need cleaned_text and others don't
  • Apply cleaned_text consistently across all yardage extractions for uniformity
  • Document in the function's roxygen comments which yardage types require text cleaning
tests/testthat/test-cfbd_pbp_data.R (1)

57-59: No action needed. The test filtering mechanism will work correctly.

The cfbd_pbp_data() function preserves the original play_text column. The clean_play_text() function creates a separate cleaned_text column via dplyr::mutate() without modifying play_text. The exact string matching on line 57 using the full timestamp strings like "(14:46) Shotgun #10 H.King..." will find the intended plays as expected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4ff4e1 and f9704f9.

📒 Files selected for processing (4)
  • R/helper_pbp_add_yardage.R
  • tests/testthat/test-cfbd_pbp_data.R
  • tests/testthat/test-cfbd_play_stats_types.R
  • tests/testthat/test-cfbd_play_types.R
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • R/helper_pbp_add_yardage.R
📚 Learning: 2025-08-25T01:47:09.915Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_play.R:731-733
Timestamp: 2025-08-25T01:47:09.915Z
Learning: In the cfbfastR codebase, the pattern `tibble::tibble(data = .data$.)` in R/cfbd_play.R is valid and should not be flagged as incorrect usage of the .data pronoun.

Applied to files:

  • tests/testthat/test-cfbd_play_stats_types.R
  • tests/testthat/test-cfbd_play_types.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (release)
🔇 Additional comments (5)
tests/testthat/test-cfbd_play_stats_types.R (1)

5-12: LGTM - Test expectation updated for new play stat type.

The row count expectation reflects an additional play stat type from the API. Note that this test is coupled to external data and may require future updates if the API schema changes.

R/helper_pbp_add_yardage.R (3)

55-58: Good defensive initialization of cleaned_text.

This fallback ensures the function works correctly whether or not the caller has pre-processed the data with clean_play_text().


62-113: Previous regex alternation issue resolved by restructuring.

The lookbehind patterns have been correctly split into separate branches for "run" and "rush" variants, eliminating the parsing ambiguity flagged in the previous review. Each pattern is now unambiguous (e.g., "(?<= run for a loss of)" vs "(?<= rush for a loss of)").


158-167: Verify regex pattern combination for PASSER pass edge case.

Lines 163-167 check for both "pass$" (end of string) AND "^to " (start of string) in the same cleaned_text. This seems contradictory unless clean_play_text() transforms the text in a way that makes both patterns matchable.

Could you verify this logic works as intended with sample 2025 play data? If cleaned_text cannot simultaneously end with "pass" and start with "to ", this branch will never match.

tests/testthat/test-cfbd_play_types.R (1)

5-12: LGTM - Test expectation updated for new play type.

The row count expectation reflects an additional play type from the API. Same consideration as the play stats types test regarding external data coupling.

Changed the default value of season_type to 'both' in cfbd_game_info and its documentation. Updated week validation to allow weeks 1-16 instead of 1-15.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
R/utils.R (1)

253-262: Error message doesn't match the updated validation range.

The range_check now validates weeks 1-16, but the error message on line 260 still references "1-15". This will confuse users who enter an invalid week value.

🔧 Proposed fix
     if(!all(checks)){
-      cli::cli_abort(glue::glue("Enter valid {deparse(substitute(week))} 1-15\n(14 for seasons pre-playoff, i.e. 2014 or earlier)"))
+      cli::cli_abort(glue::glue("Enter valid {deparse(substitute(week))} 1-16\n(14 for seasons pre-playoff, i.e. 2014 or earlier)"))
     }
🧹 Nitpick comments (2)
R/cfbd_play.R (1)

113-114: cfbd_plays() still defaults to "regular" while cfbd_play_stats_player() defaults to "both".

For consistency with the changes made to cfbd_play_stats_player() (line 282) and other functions mentioned in NEWS.md, consider updating cfbd_plays() to also default to "both". The documentation on line 59 would also need updating if this change is made.

♻️ Suggested change for consistency

Line 59 (documentation):

-#' @param season_type (*String* default regular): Season type - regular, postseason, both, allstar, spring_regular, spring_postseason
+#' @param season_type (*String* default both): Season type - regular, postseason, both, allstar, spring_regular, spring_postseason

Line 114 (function signature):

 cfbd_plays <- function(year = 2020,
-                       season_type = "regular",
+                       season_type = "both",
                        week = 1,
R/cfbd_games.R (1)

258-262: Consider aligning season_type defaults across related functions.

For awareness: cfbd_game_info() and cfbd_game_media() now default to "both", while cfbd_game_weather(), cfbd_game_player_stats(), and cfbd_game_team_stats() still default to "regular". If this inconsistency is intentional for this PR scope, no action needed—but you may want to align these in a follow-up for API consistency.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9704f9 and 08bfb33.

📒 Files selected for processing (8)
  • DESCRIPTION
  • NEWS.md
  • R/cfbd_games.R
  • R/cfbd_play.R
  • R/utils.R
  • cran-comments.md
  • man/cfbd_game_info.Rd
  • man/cfbd_play_stats_player.Rd
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-25T01:45:22.587Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_metrics.R:330-334
Timestamp: 2025-08-25T01:45:22.587Z
Learning: In cfbfastR, apparent inconsistencies in position validation lists between different CFBD API endpoints (like "C" vs "OC") are often intentional due to different data providers backing different endpoints. Don't flag these as issues to fix unless there's clear evidence of a functional problem.

Applied to files:

  • NEWS.md
  • cran-comments.md
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • NEWS.md
  • man/cfbd_play_stats_player.Rd
  • R/cfbd_play.R
  • DESCRIPTION
  • R/cfbd_games.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (release)
🔇 Additional comments (7)
DESCRIPTION (1)

3-3: LGTM!

Version bump to 2.2.0 is appropriate for the new features and bug fixes. Adding patrick to Suggests is a good choice for parameterized testing support.

Also applies to: 61-61

cran-comments.md (1)

7-8: LGTM!

Release notes accurately summarize the validate_week() bug fix and documentation updates for the CRAN submission.

NEWS.md (1)

1-5: Release notes look good, but verify cfbd_plays() consistency.

The NEWS entry documents season_type default changes for cfbd_game_info() and cfbd_play_stats_player(). However, cfbd_plays() in R/cfbd_play.R still defaults to "regular" (line 114), while cfbd_play_stats_player() now defaults to "both".

Is this intentional, or should cfbd_plays() also be updated for consistency?

R/cfbd_play.R (1)

192-192: LGTM!

The cfbd_play_stats_player() function has been consistently updated with both documentation (line 192) and default parameter (line 282) changed to "both".

Also applies to: 282-282

man/cfbd_game_info.Rd (1)

10-10: LGTM!

The season_type default is updated to "both" consistently in both the usage signature and argument description. This aligns with the source change in R/cfbd_games.R line 132.

Also applies to: 25-25

R/cfbd_games.R (1)

70-70: LGTM!

The roxygen documentation (line 70) and function signature (line 132) are consistent in updating the season_type default to "both" for cfbd_game_info().

Also applies to: 132-132

man/cfbd_play_stats_player.Rd (1)

14-14: Documentation is consistent.

The season_type default of "both" is correctly applied in both the function signature (line 14) and the argument description (line 33). The source R/cfbd_play.R file has the matching roxygen2 documentation and function definition.

@saiemgilani saiemgilani self-assigned this Jan 12, 2026
@saiemgilani saiemgilani merged commit 33f5eef into main Jan 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants